t-SNE overview

Basic information

  • t-SNE: t-distributed stochastic neighbor embedding

  • Method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map

  • Computational complexity of direct implementation: \(O(N^{2})\)

  • Original paper - Laurens van der Maaten, Geoffrey Hinton, 2008

T-sne is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and Geoffrey Hinton, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

t-SNE summary
Advantages Disadvantages
Handles non-linear data efficiently Computationally Complex
Preserves local and global structure Non-deterministic
Requires Hyperparameter Tuning

Example data

Let’s assume that we have 5 datapoints (“A”, “B”, “C”, “D”, “E”) embedded in 3-dimensional space. Our objective is to embed these points in the 2-dimensional space so that the local structure is preserved.

Example data
x y z
A 1.00 0.90 0.10
B 1.10 1.00 0.20
C 1.05 1.05 -0.10
D 1.50 1.02 0.10
E 1.40 1.01 0.15

Step 1 - calculation of distances between the points

Firstly, let’s calculate the eucliadean distances between the points in the high-dimensional space, i.e.:

\[ d(x,y) = \sqrt{ \sum_{i=1}^{n} (x_{i} - y_{i})^{2} }\]

Distances between points
A B C D E
A 0.000 0.173 0.255 0.514 0.418
B 0.173 0.000 0.308 0.413 0.304
C 0.255 0.308 0.000 0.493 0.432
D 0.514 0.413 0.493 0.000 0.112
E 0.418 0.304 0.432 0.112 0.000

Step 2 - calculation of probability distribution in high-dim space

For each point(!) in our dataset we shall calculate the similarity to other datapoints, which is described in the original paper as follows: similarity of datapoint \(x_{j}\) to datapoint \(x_{i}\) is the conditional probability \(p_{j|i}\) that \(x_{i}\) would pick \(x_{j}\) as its neighbor. Note that definition is not symmetric, i.e. \(p_{j|i} \neq p_{i|j}\).

Let’s pick one point, e.g. “A” and calculate the aforementioned conditional probabilities with respect to point “A”:

x
Dist A-A 0.000
Dist A-B 0.173
Dist A-C 0.255
Dist A-D 0.514
Dist A-E 0.418

We map the distances to the Gaussian distribution with mean = 0. The probabilities that point “A” will pick the point “B” as its neighbor is equal to the value of density of normal distribution \((0, \sigma_{A})\) in point “Dist A-B”:

Additionally, we assume that \(p_{i|i} = 0\) by definition. After we calculate the densities for all points, we normalize the values in order to have the probability distribution. Overall, the probability that point \(i\) will pick point \(j\) is given by the formula:

\[ p_{j|i} := \frac{exp(-|| x_{i} - x_{j} ||^{2} /2 \sigma_{i}^{2} )}{\sum_{k \neq i} exp(-|| x_{i} - x_{k} ||^{2} /2 \sigma_{i}^{2} )} \]

The remaining issue is the algorithm of determining the sigma values in the normal distribution (c.f. Appendix 1).

Original dataset (high-dim)
Point A Point B Point C Point D Point E
distance (euclidian) 0 0.173 0.255 0.514 0.418
probability 0 0.752 0.701 0.470 0.563
norm. probability 0 0.302 0.282 0.189 0.226

Now, we have the similarity of datapoint “A” to other datapoints defined as the conditional probability \(p_{A|X}\) (\(A\) would pick \(X\) as it neighbor):

\(P_{A}:\)
\(p_{B|A} = 0.302\)
\(p_{C|A} = 0.282\)
\(p_{D|A} = 0.189\)
\(p_{E|A} = 0.226\)

Let’s calculate all 5 distributions (one distribition per one datapoint):

Transformed dataset (low-dim)
Point A Point B Point C Point D Point E
P_A 0.000 0.302 0.282 0.189 0.226
P_B 0.285 0.000 0.250 0.215 0.251
P_C 0.292 0.275 0.000 0.204 0.229
P_D 0.204 0.246 0.213 0.000 0.337
P_E 0.221 0.260 0.215 0.305 0.000

Step 3 - calculation of probability distribution in low-dim space

Let’s do exactly the same thing in low-dimensional space by performing the following steps:

  • we randomly(!) select the coordinates of all datapoints in low-dimensional space
  • instead of normal distribution we apply t-student distribution

A knitr table
Point A Point B Point C Point D Point E
distance (euclidian) 0 0.130 0.366 0.248 0.227
probability 0 0.771 0.610 0.706 0.720
norm. probability 0 0.275 0.217 0.252 0.257

Overall, the probability, that point i will pick point j is given by the formula:

\[ q_{j|i} := \frac{exp(-|| x_{i} - x_{j} ||^{2} /2 )}{\sum_{k \neq i} exp(-|| x_{i} - x_{k} ||^{2} )} \]

Step 4 - training of supervised model

Now, for each point we have two distributions: in low-dim space (Q) and high-dim space (P).

A final table
Point A Point B Point C Point D Point E
P_A 0.000 0.302 0.282 0.189 0.226
Q_A 0.000 0.275 0.217 0.252 0.257
P_B 0.285 0.000 0.250 0.215 0.251
Q_B 0.254 0.000 0.232 0.256 0.258
P_C 0.292 0.275 0.000 0.204 0.229
Q_C 0.216 0.249 0.000 0.268 0.267
P_D 0.204 0.246 0.213 0.000 0.337
Q_D 0.232 0.256 0.250 0.000 0.262
P_E 0.221 0.260 0.215 0.305 0.000
Q_E 0.236 0.256 0.247 0.261 0.000

Now, the idea is to change to coordinates in low-dim space so that the \(Q_{i}\) distributions are similar to the \(P_{i}\) ones so that we can reflect the local structures of high-dim space in low-dim space. Effectively, we tackle the problem of finding optimizing the parameters (just like in supervised model):

  • Parameters to be trained: coordinates of points in low-dimensional space
  • Actual values: P-distributions
  • Optimizer: stochastic gradient descent
  • Loss function: Kullback–Leibler divergence:

\[ D_{KL} (P || Q) = \sum_{x \in X} P(x) \cdot log_{2}\frac{P(x)}{Q(x)} \]

or, using the previous notations (for point A):

\[ D_{KL} (P || Q) = \sum_{j \in \{B, C, D, E \}} p_{j|A} \cdot log_{2}\frac{p_{j|A}}{q_{j|A}} \]

For all points in the dataset:

\[ D_{KL} (P || Q) = \sum_{i \in \{A, B, C, D, E \}} \sum_{j \in \{A, B, C, D, E \}} p_{j|i} \cdot log_{2}\frac{p_{j|i}}{q_{j|i}} \]

In practise, we can also define \(p_{ij}:=\frac{p_{j|i} + p_{i|j}}{2n}\) and change the loss function to the folowing form:

\[ D_{KL} (P || Q) = \sum_{i \in \{A, B, C, D, E \}} \sum_{j \in \{A, B, C, D, E \}} p_{ji} \cdot log_{2}\frac{p_{ji}}{q_{ji}}, \]

where \(p_{ij} = p_{ji}\) and \(q_{ij} = q_{ji}\).

Final output

From technical reasons the original dataset was extended by adding some noise.

The results in 2-dim space are the following:

Examples

Iris

Iris dataset
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa

MNIST

Appendix